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The  objective  of  the  study  vas  to  explore  the  appropriateness  of  several 
statistical  techniques  in  developing  predictive  models  of  consujiaers'  long 
distance  telephone  expenditux-e  based  on  the  analysis  of  socioeconomic  and 
demographic  characteristics.  Specifically,  the  paper  examines  the  relative 
efficacy  of  stepwise  multiple  regression,  monotonic  A.ID  (Automatic  Interaction 
Detector)  and  free  AID  in  analyzing  large  scale  data. 

Description  of  MRTS  Data  Bank. 

Most  consumer  behavior  research  to  date  has  been  ad  hoc,  fragmentary, 
and  exploratory.  Only  irecently  have  large  corporations  begun  to  generate  con- 
tinoxis  and  systematic  infonnation  about  the  market  place  as  part  of  their 
marketing  information  systems.  The  Bell  System  Companies  provide  comm\ini cation 
services  in  the  hB   continentaJ.  states  and  the  District  of  Columbia  to  lOU 
million  telephones,  83/^  of  the  total  telephones  in  the  United  States.  The 
need  to  understand  this  er-onaous  mai-ket  is  self-evident,  not  only  from  a 
traditional  marketing  view  but  also  from  the  social  and  economic  consider- 
ations inherent  in  the  laarageTnent  of  a  regulated  utility.  To  help  meet  this 
need,  a  latrge  scale  Market  Research  Information  System  (MRIS)  has  been 
established,  consisting  oi  a  national  longitudinal  panel  of  some  60,000 
customers  representing  both  the  business  and  residence  markets. 

MRIS  panel  members  vere  selected  by  multistage  stratified  sampling  pro- 
cedures from  the  customer  files  of  each  .f  the  one  hiindred  accounting  offices 
of  the  Bell  System  where  customer  billing  i.J  performed.  A  panel  of  600  customers, 
evenly  divided  between  business  and  residence  customers ,  was  selected  from 
each  of  these  accounting  offices.   Cttrreutlir,  the  MRIS  data  bejik  contains  more 
than  126  million  card  image  record^s,  and  is  growing  at  the  rate  of  3  1/2 
million  records  each  month.  The  I'lRIS  panel  excludes  certain  types  of  accounts, 
such  as  the  U.S.  Government,  which  are  handled  sepeirately  from  a  communications 
view  point,  and  certain  specialized  types  of  communications  services  such  as 
private  line  and   data  services , 

For  each  panel  member,  the  MRIS  data  bank  stores  the  following  information: 

1.  A  basic  equipment  record  consisting  of  service  and  eqxiipment  data, 
such  as  the  number  and  type  of  telephone  lines ,  number  of  extension 
phones  and  other  vertical  (optional;  services  including  Princess* 
and  Trimline**^  phones,  Touch-Tone*  service  and  additional  Directory 
listings .  These  data  are  updated  whenever  psuiel  menibers  change 
their  service  or  equipment . 

•/Registered  trademark  of  the  Bell  System. 


2,  A  billing  amount  record  listing  charges  for  local  service, 
additional,  message  units  (where  applicable),  a  sioniKiary  of 
long  distance  billing,  taxes  and  other  charf^es  or  credits 
as  ^hoTTO  on  tne  customer's  bill.  This  record  is  expanded 
every  month. 

3.  A  long  dl^-tance  record  listiiTig  liljing  details  for  each  meseage 
found  on  the  custcicer ' £■  bil^J-ng.  Ktaieneiit,  sv-cb  as  the  date  and 
time  of  the  call,  type  of  cell  (direct  dial  or  operator  handled), 
length  of  coTivei'sabioii  and  amount  of  charge.  This  record  is  also 
expended  each  month. 

h.     A  demographic  record  cont^^intn^  a  socioeconomic  and  demographic  pro- 
fil"^  of  the  residence  custoner's  bov.sfchold  unit.  These  data  have  been 
obtained  from  a  raaii  questionnaire,  and  include  age,  sex,  education 
and  occupation  of  the  head  o?  hcuEieholdj  familj^  size  and  composition, 
its  mobility  characteristics  tuid  family  income.  The  residence  profile 
is  updated  every  three  yeax's ,  ejid  consideration  is  presently  being 
given  to  collectin'^  additional  information  on  the  residence  customer's 
fundamental  value  system  as  veil  .xs  his  generalized  a  titudes  toward 
the  telephone  as  a  means  of  communication. 

To  be  able  to  ccmprehend  this  enormous  amount  of  information  at  the 
microlevel  of  the  individual  customer,  basic  research  is  underway  to  develop 
analytic  strategies  in  the  interest  of  building  predictive  micro  and  macro 
models  of  Duyer  behavior  for  the  telephone  industry. 

TJ-iis  paper  represents  one  of  our  projects  designed  to  develop  an  xmder- 
standing  of  the  loiig  distance  telephone  behavior  of  the  residence  customer.  Tlie 
relationship  of  socioeconomic  and  demographic  factors  to  long  distance  behavior 
is  especiallj'-  import,aiat  to  iusure  that  the  rate  structures  filed  by  the 
Telephone  Ccmpaiiies  and  approved  by  ths  regulatory  agencies  are  equitable  to 
the  varioxis  socioeconomic  castomer  segments  in  the  cotintryS. 

In  order  to  ex:aiine  the  char2,ctaristics  of  the  data  and  the  associated  pro- 
blems in  their  analysis,  the  stuciy  wt;s  limitsd  to  the  793  panel  customers 
from  two  RorthQEstern  states .  'fhe  two  groups  of  customers  were  chosen  on  the 
basis  of  comparability  of  long  distancp  expeiidi.ture ,.  providing  a  scinple  size 
that  woiild  not  unduly  favor  ons  psxtic^xlar  analytic  technique-  The  focus  of 
this  paper  is,  hovevei-,  data  analysiri  pncl  not  inference,  although  decision 
and  predictive  modelit  i-we   being  bullr,  based  en  the  find.lngs  from  this  type  of 
data  analysis  utilizing  larger  ■xa.t  more  generalised  samples  of  t-ie  population. 

Table  1  lists  the  fourteen  so^ieconoioic  and  demographic  variables  and  the 
dependent  variable,  long  distance  expenditure-,  with  their  meanu  and  standard 
deviations.  To  avoid  the  effects  of  neasoaality  said  holidays,  the  long  distance 
behavior  variable  was  based  cii  the  monthly-  average  of  a  year's  history  for  each 
residence  customer,  expressed  in  dollars  and  cents.  Tlie  dollar  signs  have  been 
omitted  for  the  sake  of  simplij.city.  Among  the  fourteen  socioeconomic  sad 
demographic  variables,  two  index  "-ariables  have  been  included j  the  socioeconomic 
status  (SES)  index  and  the  Life  Cycle  index.  The  3ES  in-^ex  in  a  score  developed 
from  a  composite  of  th«  education  and  occupation  of  the  head  of  household  end 
family  income  level  using  the  procedures  of  the  U.S.  Sureau  of  the  Csnsus  (1963). 
The  Life  Cycle  index  is  determined  from  the  age  and  maritpJL  status  of  the  head 
of  household  and  fanily  coapositionj"  following  the  proc^dtyres  used  by  the  Survey 
Research  Center  at  the  University  of  Michigan  (Lansing  &  Kish,  195T). 


TAJBLE  1 
List  of  Variables 


Standard 

Number 

Description 

Mean 

Deviation 

1 

Socio-Econoniic  Status 

6.385 

2.079 

2 

Own/Rent 

1.267 

0.I+U3 

3 

Type  of  Residence 

1.50^ 

0,803 

k 

No.   of  Floors 

1.90a 

0.790 

5 

No.   of  Rooms 

5.971*    , 

I.7U7 

6 

Length  of  Residence 

U.165 

1.662 

7 

Ho.   of  Moves    (in  past   5  yrs.) 

i.Uo 

0.788 

8 

Sex  of  H.   H. 

1.187 

0.390 

9 

Age  of  H.   H. 

5.009 

1.383 

10 

Occupation  of  H.   H. 

6.095 

2.322 

11 

Education  of  H.   R. 

1^.276 

1.789 

12 

Family  Income 

it.  511 

1.801 

13 

Family-  Size 

3.279 

1.629 

ll* 

Life  Cycle 

l^.76U 

1.585 

15 

Long  Distance 

Expenditure   (average  month) 

7.219 

10.376 

Problems  of  Data  .tealysis  and  Alternative  Statistical  Approaches 

In  Figure  1,  long  distance  expenditure  is  plotted  for  each  socioeconomic 
and  demographic  variable.  The  variables  have  been  grouped  in  four  categories 
derived  from  a  prior  factor  -analysis  of  these  data.  Examining  the  plots 
clearly  points  out  the  followinfr  dpta  problems  t;,T)icaJ.  of  most  siirvey  research 
(Morgan  &  Sonquist.  1963;  Carman,  1967',  Sonquist^  1970): 

1.  All  of  the  deraosraphic  variables  are  discrete  rather  thsun 
continuoos ,  a3.though  r;ary  ot'  theni  c.o  have  ruccessive  class 
intervals  containing  larg»  aunibeis  of  observations. 

2.  The  variables  have  a  irdxture  of  scales  cons^isting  of  nominal 
and  interval-scaled  dats-. 


3.  The  relationship  of  long  distance  e:cpendl.ture  with  many  of 
the  demographic  variables  is  not  linear. 

h.     In  some  cases,  the  relationship  is  not  even  monotonic. 

5.  The  demographic  variables  may  be  related  to  long  distance  tele- 
phone behavior  in  an  interactive  manner  rather  thsua  in  a  simple 
additive  mamier. 

6.  The  demographic  variables  tend  to  be  correlated  with  one  another, 
which  may  be  a  serious  problem  when  using  regression  analysis 
(Blalock,  1963).  Table  2  s-ummarizes  some  of  the  highly 
msiltlcolliaear  variables. 
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TABLE  2 
Mxiliicollineai'ity  i'iinong  Demographic  Variables 


Variable 


1.  SES  Score 

2.  Own/Rent 

.3.  Type  of  Residence 

k.  No.  of  Floors 

5.  Length  of  Residence 

6.  /-ge  of  H.  H. 

7.  Occupation  of  H.  H. 
6.  Education  of  H.  H. 


Correli,tion 


0.51 


Variable 


0.-^5 
0.T6 
0.78 

Occupation  of  H.  H. 
Education  of  H.  H. 
Incoffie  of  H.  H. 

0.69 
-0.56 

Type  of  Residence 
No.  of  Rooms 

0.5^ 

No.  of  Rooms 

0.52 

No.  of  Rooms 

-0.71 
0.50 

No.  of  Moves 
Age  of  H.  H. 

0.81 

Life  Cycle 

0.51 

Family  Income 

Family  Income 


Starting  with  a  large  body  of  empiricaJ.  data  and  li-^tle  or  no  prior  theory 
with  which  to  develop  functional  relationships,  it  is  difficult  to  construct 
a  multivariate  model  without  first  investigating  the  effects  of  these  data 
problems  as   they  relate  to  ana-iysis  by  standard  st8,tistical  methods.  In  fact, 
the  authors  believe  that  the  blind  u-5e  of  a  statistical  method  maj'-  produce  great 
harm  from  mlsintex-pretatloa  of  the  data.,  and  could,  even  result  in  throwing  out 
important  data  as  irrelevant  or  useless  for  a  marketing  probl'.em. 


Our  objective  was,  therefore,  to  first  e 
which  take  into  consideration,  to  ■^rvjrji.np  de 
multiple  regression,  AID  vith  the  nonotonic 
ables,  and  >ill,>  without  the  mono-ionlc  reatric 
alternative  strategies  because  each  r'ispcnds 
characteristics.  Such  a  combinati:n  should, 
of  forming  an  \;.nsupportable  advsace  bypothes 
on  the  actual  sti*ucture  ijnder  obse^-vation 


xplore  s'everal  analytic  strategies 
greefi ,  these  data  problems.  Stepwise 
restriction  on  the  predictor  vari- 
ticn  vere  chosen  as  the  three 
somewhat  differently^  to  these  data 
therefore,  both  avoid  the  problem 
is  and  give  the  analyst  perspective 


Miiltip.'.e  regression  is  a  robuit  method,  rich  in  both  data  analysis  and 
inference.  In  addition,  regression  analysis,  with  spme  variations,  can  take 
into  accoxont  the  problem  of  mixed  scsQes  and  class  interval  data.  For  example, 
by  vising  dummy  variables  it  is  possible  to  include  a*  number  of  nomineilly 
scaled  demographic  descriptors  such  as  ownership  of  residence  and  sex  of  the 
head  of  household.  Finally,  stepwise  multiple  regression  considers  the 
problem  of  multicollinearity  by  developing  partial  correlations  at  each  step, 


1/v! 


thereby  eliminating  variables  that  axe  Jaighly  correlated  vith  variables  already- 
included  in  the  regression  equation. 5  However,  since  the  regression  equation 
is  a  linear  additive  model,  it  is  not  capable  of  effectively  handling  the 
problems  of  nonlinear,  nojimonotonic  ani  interactive  relationships^. 

The  objective  in  AID  analysis  is  to  paitition  the  total  sample  into  an 
optimal  set  of  nonoverlapping  subgroups ,  ieveloped  from  the  profiles  of  the 
predictor  variables,  vhose  categories  explain  more  of  the  variation  in  the 
dependent  variable  than  do  any  other  set  of  subgroups.  This  objective  is 
achieved  by  a  sequential  partitioning  of  the  total  sample  into  two  subgroups 
based  on  the  split  of  a  single  praciictor  variable  vhich  produces  the  largest 
ratio  of  between  sum  of  squares  to  total  suia  of  squares.  This  process  is 
repeated  on  each  of  the  subgroups  until  some  minimum  level  of  explained 
variance  is  encountered  or  a  minimum  sample  size  is  reached  in  the  subgroup. 
Thus,  one-way  analysis  of  variance  is  explicitly  included  in  the  analysis. 

Because  the  splitting  of  groups  is  sequential,  the  AID  analysis  is  a  step- 
wise procedure  similar  to  the  stepwise  regression  method,  and  so  minimizes 
the  problem  of  multicollinearity .  However,  the  optimal  split  at  each  step  is 
not  based  on  a  predictor's  contribution  to  reducing  the  error  variance  in  the 
total  sample  but  the  variance  in  the  subgroup. 

Several  researchers  have  used  the  AID  technique  in  marketing,  either  for 
developing  segments  (Assael,  1970)  or  for  model  building  (Carman,  1967; 
Armstrong  &  Andress,  1970)-  The  technique  itself  is  described  by  Morgan  and 
Sonqxiist  (1963),  Sonquist  and  Morgan  (196^0,  and  Sonqiiist  (19T0).  The  AID 
program  is  capable  of  handling  both  the  categorical  and  class  interval  predictor 
variables,  regardless  of  whether  their  relationship  is  lineeir,  nonlinear  or 
nonmonotonic  with  respect  to  the  criterion  variable.  Of  course,  the  criterion, 
or  dependent,  variable  may  be  continuous  and  in  the  case  of  long  distance 
expenditure  it  is.  FinaJLly,  and   most  importantly,  this  procedure  is  capable 
of  handling  both  the  additive  and  the  interactive  relationships  of  a  set  of 
predictors  with  the  criterion  variable. 

Two  types  of  AID  analyses  were  lised  to  r^eparately  examine  the  nonmonotonic 
and  the  interactive  effects  of  the  relationships  between  long  distance 
expenditure  and  the  demographic  variablcb.  Ttie  first,  monotonic  AID  analysis, 
preserves  the  ordinal3.ty  of  the  predictor  variable  vhen  it  is  chosen  &s   a 
candidate  to  split  the  sample.  Therefore,  the  two  new  subgroups  are  defined 
as  above  and  below  the  boundai^y  of  a  category  interval  of  the  predictor 
variable.  For  example,  given  the  eight  categories  of  income,  there  are  only 
seven  (K-l)  comparisons  possible  b:/  splitting  the  group  at  each  of  the  adjacent 
categories.  By  definition,  therefore,  monotonic  AID  analysis  is  capable  of 
handling  nonlinear  relationships  as  long  as  they  are  monotonic  or 
order-preserving . 

The  second  procediure,  free  AID  analysis,  allows  the  split  on  a  predictor 
variable  withput  regard  to  the  order  of  the  categories  of  that  variable.  Thus, 
there  is  a  much  larger  number  of  combinations  of  the  predictor  variable  categories 
which  me^y  be  examined  to  split  the  sample.  This  removes  the  nonlinear  restriction 
and  allows  for  the  analysis  of  a  nonmonotonic  relationship  if  one  exists  between 
the  predictor  and  the  criterion  variable. 


Ccaaparative  Data  Analysis  and  Results 


The  stepwise  linear  regression  analysis  was  performed  using  the  UCLA 
Biomedical  computer  program  BMD  02R  (Dixon,  1971).  To  avoid  highly  collinear 
variables  and  randcan  effects,  an  F  value  of  3.65,  comparable  to  0.05  level 
of  significance,  was  set  for  a  predictor  variable  to  enter  into  the  equation. 
The  results  of  the  stepwise  regress:'  on  analysis  are  summarized  in  Table  3. 
The  multiple  R  was  0.36,  resulting  in  12.68  percent  of  the  variance  in  long 
distance  expenditure  being  explained  by  four,  of  the  demographic  predictors. 
These  four  significant  predictors  and  their  associated  explained  variances  sure 
(l)  family  income,  9.86^  (2)  number  of  rooms,  0.75>^  (3)  length  of  residence,  0.98^ 
and  {h)   life  cycle  of  the  family  1.09%.     Tlie  relationship  is  positive  with 
income,  number  of  rooms  and  life  cycle,  but  negative  with  length  of  residence. 
In  short,  the  greater  the  income,  the  more  rooms  in  the  residence  unit,  the 
later  the  stage  of  the  life  cycle,  and  the  more  recent  the  move  of  a 
residence  customer,  the  greater  the  average  long  distance  expenditure  will 
be.  It  is  interesting  to  note  that  life  cycle  as  an  index  variable  performed 
better  than  its  component  demographic  variables  but  the  SES  index  did  not 
perform  better  than  income. 


TABLE  3 


Stepwise  Linear  Regression  Analysis 
Vai'iables  in  Equation 


Step 

Variable 

Beta  Coef . 

Standard 
Error 

'  Multiple 
R      RSQ 

F  Ratio 
to  Enter 

1 
2 
3 
U 

(Constant 

Income 

No.  of  Rooms 

Length  of  Residence 

Life  Cycle 

0.0001) 
0.2699 
0.1335 
-0.1*^13 
0.1k?2i; 

i   

0.0387 
0.0393 
0.0391 
0.0389. 

0.31U0  0.0986 
0.3257  0.1061 
0.3»+0l+  0.1159 
0.3561  0,1268 

86.50 
6.62 
8.7U 
9.88 

Analysis  of  Variance 


Source  of 
.Variation 

Degrees  of 
Freedom 

Sum  of 
Squares 

Mean 
Square 

F  Ratio 

Total 

Regression 

Residual 

792 

k 

788 

35,261* 
10,812 
7U,U52 

2,703.05 
.  gk.UQ 

28.61* 

Percent  Variance  Explained  12.68 
•Significant  at  the  0.01  level. 
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Several  questions  are  implicit  in  these  findings.  First,  the  b\ilk  of 
the  variance  explained  is  concentrated  in  income  (77.7$)  with  relatively  little 
contribution  from  the  other  demographic  variables.  Secondly,  the  total  amount 
of  explained  variance  is  relatively  lower  than  might  be  expected7.  And  third, 
most  other  demographic  variables  fail  to  exhibit  any  relationship  on  a  partial 
correlation  basis.  There  are  obviously  two  answers.  First,  no  relationship  may 
exist  with  these  other  variables ,  so  that  long  distance  expenditure  is  determined 
by  other  factors  not  in  the  equation.  However,  and  secondly,  it  is  possible 
that  the  linear  additive  model  built  iato  the  regression  suppresses  any  non- 
linear or  interactive  relationship  between  the  demographic  variables  and  long 
distance  telephone  behavior.  Unless  the  latter  explanation  is  riiled  out, 
good  data  may  be  discarded  due  to  inappropriate  analytic  methods. 

The  monotonic  AID  analysis  was  the  second  method  used.  This  procedure  allows 
for  monotonic,  nonlinear  and  interactive  relationships  between  the  predictor 
variables  and  the  criterion  variable  as  noted.  To  avoid  unstable  results  and 
to  meet  sampling  error  requirements ,  the  AID  analysis  was  ba^ed  on  the  additional 
constraints  of  a  minimxam  sample  size  of  30  in  each  final  subgroup,  emd  a  minimum 
percent  variance  explained  equal  to  or  greater  than  0.6  percent  at  each  step. 

The  statistical  results  are  summarized  in  Tables  h   and  5.  The  explained 
variance  was  increased  from  12.68  percent,  using  regression  analysis,  to  l6.0U 
percent  lasing  AID  and  allowing  monotonic  and  interactive  relationships.  Table  k 
shows  that  the  SES  score,  the  education  and   age  of  the  head  of  household  and 
number  of  moves  have  entered  into  the  analysis.  The  additional  explanatory 
power  comes  from  (l)  the  inclusion  of  these  vsLriables  and  (2)  the  increases 
in  the  predictive  power  of  the  variables  as  against  the  regression  equation.  The 
best  examples  of  increased  predictive  power  are  the  SES  score,  which  was  1.3^^, 
and  the  number  of  rooms  which  increased  from  0.75^  to  2.69^  variance  explained. 
These  values  are  the  summation  of  individual  percent  variance  explained  in 
Table  k.     This  increased  predictive  power  can  be  explained  by  the  fact  that  both 
of  these  variables  have  a  step  function  with  the  long  distance  expenditure  as 
seen  from  the  plots  in  Figure  1.  A  similar  step  function  in  length  of  residence 
also  slightly  increases  its  predictive  power. 


Monton- 

TABLE  5 
.c  AID  Aiiaiyis 

Analysis  of  Variance 

Source  of 
Variation 

Degrees  of 
Freedom 

Sum  of 
Squares 

Mean 
Square 

F  Ratio 

Total 

Between 

Within 

792 

15 

777 

85,261i 
13,6lU 
71,590 

911.61 
92.  lU 

9.89* 

Percent  variance  explained  l6.0iv 
•Significant  at  the  0.01  level 
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On  the  other  hand,  the  predictive  power  of  income  and  life  cycle  decreased 
slightly  in  monotonic  AID.  This  is  largely  due  to  the  small  number  of  cases 
at  the  iipper  end  of  the  income  scale  and  at  the  lower  end  of  the  life  cycle 
index.  These  small  cell  sis;es  do  not  permit  further  subdivision  due  to  the 
restriction  of  the  minimum,  size  in  the  finst].  groups  formed. 

A  careful  examination  of  the  group  splits  which  result  in  large  between 
sum  of  squares  reveals  that  they  are  cLrsctly  a  function  of  the  trtaacation  in 
the  monotonic  re].atioriship  of  the  predictor  variable  with  the  criterion 
variable.  Thus,  the  greater  the  rise  of  the  step  in  the  function  between  the 
predictor  and  the  criterion  vitriable,  the  greater  the  relative  predictive 
power  that  variable  possesses,  oxirprisingly,  the  interaction  rnnong  the 
significant  demographic  variables  is  not  as  great  as  expected.  "Hiis  is  shown 
in  the  tree  diagram  of  Figure  2  by  the  relativelj.^  good  sjomaetry  of  splits  \rtiere 
the  predictive  variables  appear  on  both  branches  of  the  split.  Stronger 
interaction  between  the  demographic  and  the  socioeconomic  variables  had  been 
anticipated;  however,  this  interaction  does  not  seem  to  be  present  in  the  data. 

Finally,  a  very  important  benefit  of  monotonic  AID  analysis  comes  from 
the  fact  that  it  matches  the  managerial  problem  definition  and  decision-making 
process.  Typically,  marketing  management  is  interested  in  market  differentiation 
and  discrimination  for  better  marketing  effectiveness.  Furthermore,  the  market 
differentiation  is  based  on  segmenting  customers  who  are  pres^jmed  to  have 
different  wants  and  desires.  The  demographic  variables  are  considered  the 
most  common  casual  factors  in  bringing  these  differences  to  light,  especially  by 
the  regulatory  and  other  governmental  agencies  in  the  utility  industry.  The 
AID  results  have  been  represented  as  a  diagram  in  Figxire  2  which  is  both 
meaningful  and  communicable  to  management.  For  example,  it  indicates  that 
customers  with  income  above  $15,000,  with  residence  units  consisting  of  eight 
or  more  rooms ,  and  with  less  than  ten  years  of  residence  at  their  present  - 
location  have  the  highest  average  of  long  distance  calling  and  expenditure 
(group  six).  On  the  other  hand,  customers  with  less  than  $10,000  Income  and 
an  extremely  low  SES  score  manifest  the  lowest  amount  of  average  monthly 
expenditure  (group  sixteen).  Both  of  these  extreme  groups,  as  well  as  the 
other  segments,  are  meaningful  and  relate  to  management's  prior  experiences  and 
decisions .   In  view  of  the  fact  that  half  the  problem  in  successful  marketing 
reseeirch  is  its  effective  communication,  AID  seems  to  be  an  advantageous 
analytical  strategy. 

The  third  analytic  technique  is  free  AID  anadysis  where  the  nonmono- 
tonicity  of  the  i^redictor-criterion  relationship  is  taken  into  account  in 
addition  to  the  nonlinearity  and  interactive  aspects.  The  same  stopping 
criteria  were  utilized  here  (minimum  subgroup  size  greater  then  or  equal 
to  30  and  0.6  percent  variance  explained  at  each  step).  The  statistical 
results  are  summarized  in  Tables  6  and  7  and  Figure  3. 

Free  AID  ana;jLysis  increases  the  predictive  power  of  the  demographic 
variables  from  l6.0^  percent  to  20.23  percent  when  compared  to  monotonic  AID 
analysis.   In  the  process,  it  includes  occupation  and  type  of  residence  variables; 
however,  the  bulk  of  the  increased  predictive  power  in  this  analysis  comes  from 
the  demographic  variable  age  of  head  of  household,  which  has  the  greatest  non- 
monotonic relationship  with  long  distance  expenditure.  Somewhat  smaller  increases 
in  the  predictive  power  of  nujcber  of  rooms,  education,  and  life  cycle  are  also 
due  to  their  nonmonotonic  relationship  with  long  distance  expenditure. 
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TABLE  7 
Free  AID  Analysis 

Analysis  of  Variance 

Source  of 
Variation 

Degrees  of 

Freedom 

Sum  of 
Squares 

Mean 
Square 

F  Ratio 

Total 

Between 

Within 

792 

18 

85 ,261+ 
17,2U6 
68,018 

958.07 
87.88 

10.90» 

Percent  variance  explained  20.23 
•Significant  at  the  0.01  level 

Siirpri singly,  the  predictive  power  of  both  the  SES  index  and  length  of 
residence  decreased  in  the  free  AID  analysis.  This  is  attributable  to  the 
extreme  skewness  of  the  two  predictor  variables . 

The  free  AID  analysis  reveals  some  subtle  differences  among  customer 
segments  which  are  hidden  in  the  monotonic  AID  analysis.  For  example,  the 
seventh  group  which  has  the  highest  average  monthly  bill,  consists  of 
customers  who  have  greater  than  $15,000  income,  both  small  and  large 
residence  units  (three  rooms  and  eight  or  more  rooms)  and  who  are  both 
relatively  young  and  relatively  old  (between  25  and  3^   years  and  between 
55  and  6k   years).  This  was  not  fully  revealed  either  in  monotonic  AID  or 
in  stepwise  regression. 


TABLE  8 
Comparative  Analysis  of  Predictor  Variables 


Percent  Vsu:5  ance  Explained 


Stepwise 

Monotoni  c 

Free 

Regression 

AID 

AID 

Family  Income 

9.86 

9.30 

9.30 

No.  of  Rooms 

0.75 

2.69 

2.91 

Length  of  Residence 

0.98 

1.1*2 

0.61 

Life  Cycle 

1.09 

0.39 

0.72 

SES  Score 

— 

1.3U 

0.57 

Age  of  H.  H. 

■~^ 

0.27 

1*.02 

No.  of  Moves 

— 

0.17 

__ 

Education  of  H.  H. 

— 

0.1i6 

0.87 

Occupation  of  H.  H. 

~. 

«- 

0.71 

Type   of  Residence 

— ~ 

— 

0.52 

12.68 

I6.6li 

20.23 

T^ 
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In  viev  of  the  fact  tnat  the  deinoferaphic  variables  which  explain  the 
variance  in  long  distexice  axpt-iditv-re  differ  for  the  three  saalytic  methods. 
Table  8  hau  been  prepared  to  sunaniirize  their  relative  eontributioa  toward 
explaining  variance  in  the  criterion  vturiable.     Income,  which  was  by  far  the 
best  predictor,   does  not  do  fs  vfflJ,  in  the  AID  analyses ,  primarily  due  to  the 
problem  of  sample  siiie  in  the  ejclreme  colls,     The  other  three  demographic 
variables  -  nuraber  of  room;.  ,  SEE  scnre  and  length  of  residence  -  did  very  well 
in  laonotonic  AID  due  to  trifo  factors:      (l)  they  have  a  step  function  I'elation- 
ship  with  tJie  criteiion  variable,  and  (i)  the  distributions   are  badly  skewed. 
Finally,   age  of  head  of  household  does  best  in  free  AID  prirasurily  due  to  its 
nonmonotonic  relationship  T^rith  the  criterion  variable. 

Having  removed  the  aonltne&r  and  nonraonotoni c  constraints  and  permitted 
interactive  relationships  in  the  analysis,  the  explained  variance  has  increased 
from  12.68  to  20.32  peioent.     However,  the  iinexplalned  variance  still  remains 
quite  large.     An  additional  advantage  of  the  AID  analysis  is  the  ability  to 
investigate  the  variance  structixre  by  customer  segment.     Monotonic  AID  and  the 
Free  AID  analyses  both  developed  three  branches  defined  by  low,  medimum  and  high 
income  categories.      These  represent    hk%^  2S%  and  27/S  of  the  sample  size.     Table  9 
surrnnarizes  the  effects   for  each  branch. 


TABLE  9 


Summary  of  Variance 
(in  Percentages) 


Income 
Category 


Variance  j 
in  each 


Mod tonic  AID 


brar.ch       Explained       Unexplained 


X 


Less  than  $10,000 
$10,000  to  $15,000 
More  than  $15,000 


J3.1t8 

18.  V: 


I 


„i. 


)0.^9    j 


1.36 

1.28 
i<-.10 


12 .12 

1T.4S 
5i-.39 


Free  AID 
Explained     Unexplained 


1,96 
1.85 

7.12 


11.52 
16.88 
51. 3T 


Note:  9"303f  is  explained  'c^f   xhe  three  i  icome  categories. 


Fin&liy,  noting  •t;l'ie.'..  the  buli:  of  the  unexpleir.ed  variance  is  in  the  high 
income  branch  n,  seai-ch  of  Table  i-  "mfl   6  showt.  that  final  group  6  in  the 
monotonic  AID  aaa3.;;,-sif.  has  36-J-  percent  of  the-  totcl  vs,rianca,  and  group  7  in 
free  AID  analysis  h-^^s  30-'  percent   These  groups  represent  the  tail  of  the 
skewed  long  di.stence  expenditure  distribution  »nd  should  be  evaluated  further 
with  a  l£irgsr  cample  size  in  order  to  determine  if  the  socioeconomic  and 
demographic  variables  are  capable  of  fiirther  explaining  the  variance  in  these 
custraaer  segments.  Until  this  is  completed,  an  ultimate  judgement  on  the 
efficaqr  of  these  variables  in  explaining  long  distance  expenditure  can  not 
be  made.  However,  in  the  final  anal^i-sis,  it  appears  that  approximately  30 
percent  of  the  v.iriance  can  be  explained  by  these  predictor  variables,  which 
is  in  line  with  the  initial  expectations  when  the  project  was  undertaken. 


•'V  M 


An  Approach  To'.'&rd  Eripiric£>l  Model  ISuildlng 

CcJjnparing  linear  regression  and  AID  analysis,  it  is  difficxilt  to  say  which 
technique  is  better.  Each  offers  certain  ac).v.aita€e3  that  the  other  does  not, 
and  each  has  inherent  probilems.  MD  in  mich   'iiore  flexible  in  data  analysis 
€Bid  data  handling  becf-use  it  req.-.d.  res  the  Sjjiallest  set  of  esstmptions .  At  the 
same  time,  it  is  extremei.y  ccnspatiolc  vi  h  th«  managerial  ^•lefwpoint  of  the 
market  place.  Therefore,  it  h3^  the  advantage  of  better  ccanmuni eating  the 
market  research  results.  Finallry  3  the  te-lhrique  brings  into  bold  relief  the 
relationships  among  the  v&riables »  wh Ich  ..eads  the  researcher  to  think  of  the 
interactive  effects  of  a  set  of  predictor  variables.  This  is  very  likely  to 
broaden  his  inductive  theorizing  process.   On  the  other  hand,  AID  is  lacking 
in  inferential  capability-   It  is  nuch  easier  to  di.splay  the  data  than  to  build 
predictive  empirical  models  with  AID.  This  is  because  AID  is  largely  baaed  on 
analysis  of  vajriance  principles  and  therefore  requires  prior  experimental 
or  matrix  designs  to  enable  an/  inferences  to  be  drawn  from  the  analysis.  A 
second  disadvantage  with  AID  is  less  parsimony  in  data  analysis  and  model  building; 
considerable  computation  and  search  is  irlierent  in  the  technique  and  the  branching 
process  often  is  fairly  complex  snA   lengthy,  which  reduces  its  usefulness  from 
a  pragmatic  control  standpoint.  Finally,  as  was  demonstrated  in  this  paper, 
AID  requires  large  sets  of  observations  which  must  be  fairly  well-behaved  in  their 
distributions  over  the  predictor  and  the  criterion  variables.   In  other  words , 
skewness  presents  a  serious  problem  in  AID  analysis. 

Linear  regression  is  veiy  powerful  in  developing  parametric  models, 
and  provides  a  mechanism  for  establishing  point  and  interval  estimates  for 
predictive  piirposos.  On  the  other  hand,  it  presxmes  the  data  to  be  linear, 
error  free  and  additively  related  to  the  criterion  variable. 

In  view  of  the  fact  that  in  a  regulated  Industry  there  is  a  need  to  build 
powerful  predictisre  models,  a  systematic  approach  is  necessary  to  develop 
inductive  models  of  telephone  behavior  based  on  large  scale  empirical  data. 
The  data  analysis  reported  in  this  paper,  together  with  the  following  procedural 
steps  are  recommended  for  inductive  model  building. 8 

1.  Given  a  large  scale  data  base,  p  rform  ax.  initial  AID  analysis  with  as 
many  predictor  varie'vles  6S  are  available  or  cte  be  handled  by  the 
computer.  The  AID  anaD^^jsis  will  bring  into  bold  relief  the  nature  of 
the  relationships  among  the  variables  resulting  from  a  minimum  set  of 
restrictions  with  respect  to  the  aampio  size  of  subgroups,  split 
reducibility  criterion  Riid  the  priorit;v  ordering  and  coding  aspects  of 
the  predictor  ve.riab3es.   In  short,  a  free  AID  paalysis  is  recommended 
in  this  initial  phar e. 

2.  From  the  initial  AID  analyris,  predictor  variables  should  be  selected 
for  future  tmalysis  b.ased  on  their  explanatory  power.   Kie  predictor 
variables  should  then  t'i   factor  analyzed  to  estimate  the  degree  of 
intercorrelations  ejnong  them. 

3.  Choose  a  set  of  orthogonal  predictor  variables  from,  the  factor 
analysis  results  selecting  the  variable  with  the  highest  factor 
loading.  The  problems  of  error  in  measixrement  should  also,  be 
considered.  For  exanple  it  will  generally  be  more  advantageous 
to  choose  the  age  of  thf"  husband  rather  than  the  wife  if  both  are 
loaded  equeOJy  on  e  factor  because  of  the  possibility  of  response 


error  in  the  latter  variable.  Similarly,  education  voiild  be  prefered 
over  income,  ^t   the  same  time,  the  researcher  must  watch  for  the 
possibility  of  creating  an  index  variable,  especially  when  several 
predictor  variables  contribute  to  an  equal,  but  smaller,  extent  toward 
the  eigenvalue  of  the   factor.  Such  an  index,  by  definition,  wovild  be 
a  linear  additive  index. 

U.  Utilizing  the  selected  orthogorfa  predictor  variables,  the  researcher 

should  perform  a  monotonic  AID  analysis.  The  restriction  of  monotoniclty 
is  more  appropriate  for  managerial  decision  making  since  it  will  enable 
the  reseai'cher  to  develop  models  of  a  set  of  predictor  variables  which 
are  split  above  or  below  a  certadn  level. 

5.  Based  on  the  monotonic  AID  analysis  the  predictor  variables  should  be 
defined  in  terms  of  broad  categories  where  a  split  occurred.  For  income 
in  our  data,  this  is  likely  to  be  below  $10,000,  between  $10,000  and 
$15,000,  and  above  $15,000.  In  the  seme  way,  a  set  of  interactive 
predictor  variables  must  be  defined;  for  example,  income  above  $15,600  and 
eight  or  more  rooms. 

6.  If  the  interactive  effects  are  not  substantied,  as  evidenced  by  the 
monotonic  AID  analysis,  the  simplest  procedure  would  be  to  create  a 
successive  interval  scale  for  each  predictor  variable  based  on  AID 
categorization.  This  may  result  in  a  dichotomous  scale  or  a  discrete 
interval  scale. 

At  this  stage,  a  discriminant  or  regression  model  should  be  built  in  which 
the  redefined  variables  developed  from  the  prior  analysis  are  the  predictors 
and  the  phenomenon  under  investigation  is  the  criterion  variable.  If  the   ^ 
criterion  phenomenon  is  dichotomous  or  classi factory,  the  discriminant 
model  will  be  appropriate;  however,  a  regression  model  shovild  be  used  if 
the  criterion  variable  is  continuous  and  well  behaved.  The  regression  or 
discriminant  model  will  then  estimate  a  set  of  optimal  weights  for 
predictive  and  inferential  purposes. 

It  is,  however,  possible  that  the  interest  is  in  building  a  model  which 
takes  into  account  each  category  of  a  predictor  variable  separately. 
This  if3  possible  by  ijorivertinR  the  regression  or  discriminant  problem 
to  a  dumny  variate  analysis  problem. 

7.  If  there  are  strong  interactions  among  the  orthogonal  predictor 
variables  as  evidenced  from  the  monotonic  AID  analysis,  it  will  be 
necessary  to  develop  inde::  variables  based  on  the  pattern  of 
interactions .  This  should  be  relatively  easy  in  view  of  the  fact 
that  the  logical  combinations  are  likely  to  be  greatly  reduced  when 
the  stage  of  performing  a  monotonic  AID  analysis  is  reached.  Ihe 
predictive  model  can  be  built  from  these  index  variables  utilizing 
regression  or  discriminant  analysis. 

To  stmmiarize,  several  conclusions  can  be  drawn  from  these  efforts  at  inductive 
model  building  based  on  large  scale  data  banks.  First,  it  is  extremely  important 
to  examine  the  quality  of  the  data  and  the  nature  of  the  relationships  among  the 
variables.  Without  this  critical  examination,  the  researcher  is  likely  to 
fall  prey  to  a  statistical  or  mathematiceil  model  popular  at  the  time.  Most  of 
the  recent  model  building  in  mairketin^  has  been  baised  on  management  science  techniques 


which  clearly  attests  to  this  problem.  Second,  it  ia  very  unlikely  that  a  single 
statistical  model  such  as  stepwise  regression,  AID  or  discriainont  analysis  will 
be  sufficient.  The  authors  strcng!;,v  sviggest  that  a  variety  of  statistical  tools 
are  sequentially  necessary  at  various  stagfis  of  irtductive  laodel  building. 
Finally,  it  is  unlikely  that  deaogrr&phic  fe.ctors  alone  will  enable  the  re- 
searcher to  build  highlj-  predictive  models.  The  demographic  factors,  however, 
seem  highly  useful  in  segmenting  the  tot^:  population  into  subpoptilations  which 
may  be  the  logical  independent  marketing  se^Jients  reqijiring  separate  models. 

Footnotes 


1.  TOiis  study  is  part  of  ongoing  empirical  research  on  the  telephone  behavior 
of  both  residence  and  bxisiness  ciatoaeis  of  the  Bell  System  and  was 
prepared  under  the  auspices  of  the  llai'kat  Research  Section  of  the       ; 
American  Telephone  and  Telegraph  Company  in  Nev  York. 

The  authors  wish  to  express  their  appreciation  to  Mr.  N.  J,  Mammana,  Director 
of  Marketing  Research  for  his  support  of  the  study  and  to  Welling  Howell 
who  prepared  and  assembled  the  data  and  performed  the  computer  analysis. 

2.  A  Marvin  Roscoe,  Jr.  is  a  Marketing  Supervisor  at  A.T.&  T.  where  he 

is  responsible  for  developing  analytic  methodology  for  the  Market  Research 
Information  System.  Previously  he  was  with  the  Bell  Telephone  Company 
of  Pennsylvania  and  the  Long  Lines  Department  of  A.T.&  T.  in  various 
sales  and  meirketing  positions.  He  has  a  B.S.E.E.  from  Rensselaer 
Polytechnic  Institute  and  a  M.B.A.  from  the  University  of  Pittsburgh. 

Jagdish  N.  Sheth  is  presently  Professor  of  Business  and  Research  Professor 
at  the  University  of  Illinois.  Prior  to  that,  he  was  on  the  faculty  of 
Columbia  University  and  M.I.T.  He  received  his  Ph.D.  at  the  University  of 
Pittsburgh.  He  has  also  been  visiting  professor  at  the  Indian  Institute  of 
Management,  Calcutta,  and  Visiting  Lecturer  at  the  International  Marketing 
Institute,  Harvard  University.  Dr.  Sheth  is  coauthor  (with  John  A.  Howard) 
of  The  Theory  of  Buyer  Behavior ,  author  of  How  A(^vertising  Works  and  is 
a  frequent  contributor  to  business  and  scientific  journals,  especially  in 
the  area  of  marketing. 

3.  The  authors  are  well  eware  of  the  controversy  with  regard  to  the  usefulness 
of  demographic  factors  in  predicting  consumer  behavior  (Yankelovlch,  196U; 
Fran3:,  1963;  Bass,  Tigext,  \   Londale,  19681.  Ho-ffevor^  given  the  regulated 
nature  of  the  xrtility  industry,  it  is  necessary  to  understand  and  to  be 
able  to  predict  the  impact  of  corporate  strp.tegies  on  different  socio- 
economic segments  of  the  populattca. 


U.  The  term  expenditure  properl^r  connotes  the  aspects  of  consumer  buying  be- 
havior. Increasing  the  consumer's  If^vel  of  long  distance  expend! t\ire  is  e 
very  important  consideration  to  tae  telephone  industry.  Since  the  dis- 
tribution channel  is  always  available  and  there  ere  frequent  periods  of 
available  capacity,  increased  calling  during  these  periods  can  have  very 
obvious  ecoDomi c  implications  to  society. 


5.  There  are  several  other  methods  for  handling  the  multicollinearity  problem, 
such  as  examination  of  the  simple  correlations  or  factor  structure  of  the 
correlation  matrix.   In  fact,  due  to  the  order  bias  built  into  stepwise 
regression,  other  methods  should  be  used  to  reduce  the  collineaxity  pro- 
blem. 

6.  Regression  theory  can  handle  nonlinesir  and  interactive  relationships, 

but  these  need  to  be  developed  a  priori  based  on  some  theory  of  Judgement. 
Without  theory  to  suggest  rational  approaches ,  the  number  of  nonlinear 
transformations  and  the  combination  of  interactions  are  too  many  for 
regression  analysis  to  solve  efficiently. 

7.  By  setting  a  low  F  value  (O.Ol)  all  of  the  lU  demographic  variables  were 
permitted  to  enter  in  the  final  step  of  the  regression  analysis.  The 
predictive  power  increased  to  lU.25  percent.  These  results  were  replicated 
using  the  UCLA  BMD  03R  Multiple  Regression  Program.  Unfortunately,  the 
additional  explanatory  power  has  the  inherent  problems  of  instability 

and  multicollinearity. 

8.  The  procedure  is  somewhat  different  from  the  two-stage  AID-MCA  linkage 
suggested  by  Sonquist   (1970). 
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